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Abstract — The variance component tests used in genome- 
wide association studies of thousands of individuals become 
computationally exhaustive when multiple traits are analysed in 
the context of omics studies. We introduce two high-throughput 
algorithms — CLAK-Chol and CLAK-Eig — for single and mul- 
tiple phenotype genome-wide association studies (GWAS). The 
algorithms, generated with the help of an expert system, reduce 
the computational complexity to the point that thousands of traits 
can be analyzed for association with millions of polymorphisms in 
a course of days on a standard workstation. By taking advantage 
of problem specific knowledge, CLAK-Chol and CLAK-Eig 
significantly outperform the current state-of-the-art tools in both 
single and multiple trait analysis. 

Current biomedical research is experiencing a large boost 
in the amount of data generated. Individual genomes are 
being characterized at increased level of details using single 
nucleotide polymorphism (SNP) arrays, and, more recently, 
exome and whole-genome re-sequencing. At the same time, 
technologies for high-throughput characterization of tens of 
thousands of molecular omics phenotypes in thousands of 
people are becoming increasingly affordable [1] [2], [3] [4]. 

Genome-Wide Association Studies (GWAS) is a powerful 
tool for identifying loci involved in the control of complex 
traits [5]. In such studies, thousands of individuals are mea- 
sured for the trait of interest and their genomes are character- 
ized by re-sequencing or by genotyping of millions of genetic 
markers using SNP arrays. The association between genetic 
markers and a phenotype of interest is studied, with signif- 
icant association highlighting the genomic regions harboring 
functional variants involved in the control of the trait. 

Even in carefully designed population-based studies, some 
degree of relatedness and population stratification is expected. 
One of the most flexible and powerful methods of accounting 
for such substructure is the Variance Components (VC) ap- 
proach based on linear mixed models [6], [7] (Supplementary 
Note, Section 1). Recent FaST-LMM [8] implementation of 
the VC model can complete genome-wide scan of association 
between 36 millions of genetic markers and a phenotype 
in 1,000 individuals in about 4 hours using a multicore(12) 
processor. While these results are impressive and represent a 
breakthrough compared to older algorithms and implementa- 
tions, such computational throughput is still prohibitively low 
for analysis of omics data. For example, for transcriptome 
(30,000 phenotypes), the estimated time to complete the analy- 



sis would be approximately 13 years. Currently, such amount 
of computations would only be tractable if large computing 
facilities with many thousands of cores are used. 

In this work, we consider the problem of genome-wide 
association analysis using the mixed models, with special 
emphasis on analysis of thousands of phenotypes (Fig. 1). We 
demonstrate how utilization of problem-specific knowledge, 
coupled with innovative techniques of automatic generation 
of linear algebra algorithms and hardware-tailored implemen- 
tation, allows reducing the computation time from years to 
days. For the aforementioned analysis of transcriptome, our 
algorithm takes 26 days on a single node, or equivalently, 20 
hours on a standard 32-node cluster. 

In the context of linear algebra computations, we recently 
developed a symbolic system, CLAK, that closely mirrors the 
reasoning of a human expert for the generation and analysis 
of algorithms [9]. The idea is to first decompose a target 
linear algebra operation in terms of library-supported kernels, 
and then apply optimizations aimed at reducing redundant 
calculations. Since the decomposition is not unique, CLAK 
returns not one but a family of algorithms, together with their 
corresponding cost estimates. 

Figure 1 illustrates how multi-trait analysis consists of t sep- 
arate single-trait analyses, each of which, in turn, consists of 
m generalized least-squares (GLS) problems. The key to fast 
algorithms is the realization that such problems are correlated, 
both in the m and t directions; a naive approach that only aims 
at optimizing one GLS in isolation will never be competitive 
with methods that tackle sequences as a whole. With the 
help of CLAK, we generated more than 20 algorithms for 
performing genome-wide association analysis. Based on the 
cost estimates, we assessed the potential of these algorithms 
for single-trait and multiple-trait studies. Interestingly, for the 
two scenarios, the best theoretical performance was attained 
by two different algorithms, CLAK-CHOL and CLAK-ElG, 
respectively (Supplementary Note, Sections 2 and 3). 

Both algorithms incorporate a number of techniques for 
increasing the computational efficiency and save intermediate 
results across adjacent problems. In CLAK-CHOL, differently 
than in the other existing methods (such as EMMAX [10] and 
FaST-LMM), the positive-definiteness of the covariance matrix 
is exploited to utilize a faster factorization; moreover, the 
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Figure 1: Interpretation of GWAS as a 2-dimensional sequence of generalized least-squares problems (b := 
(X T M~ X X) X M _1 y). GWAS with multiple phenotypes requires the solution of mxt correlated GLS problems, originating 
a three-dimensional object B of size m x t x p. Along the t direction, the covariance matrix M and the phenotype y vary, 
while the design matrix X does not; conversely, in the m direction, M and y are fixed while X varies. Specifically, X can 
be viewed as consisting of two parts, Xl and Xr, where the former is constant across the entire grid and the latter changes 
along m. The figure also captures GWAS with single phenotype, in which case the dimension t reduces to 1. 



computation is rearranged to fully benefit from the potential 
of the underlying optimized kernels. In CLAK-ElG instead, 
redundant computation is avoided by exploiting the fact that 
the relationship (kinship) matrix is constant even across dif- 
ferent traits, and only the heritability and total variance of the 
trait change. When compared with the current state-of-the- 
art implementations such as FaST-LMM, both our algorithms 
achieve a lower computational complexity (for a comparison, 
see Supplementary Table 1). 

The size of the data sets involved in genome-wide associ- 
ation studies is considerably larger than the memory capacity 
of current processors; input and output data can only be 
stored in disk devices. Since the penalty for accessing a 
piece of data residing on disk is enormous -several order of 
magnitude greater than the cost for performing one arithmetic 
operation- it is imperative to efficiently handle the data. From 
our experiments, in fact, we observed that the time spent on 
data movement adds about 30% to the time spent on arithmetic 
calculations. Instead, through asynchronous transfers between 
memory and disk, our algorithms achieve a perfect overlap of 
computation and data movement. As long as the covariance 
matrix fits in main memory, and regardless of the size of 
the data sets -both in terms of SNPs and phenotypes-, the 
processor never idles waiting for data, thus computing at 
maximum efficiency. 

To demonstrate the practical advantages of CLAK-ElG and 
CLAK-Chol, we implemented routines and compared their 
execution time with that of well-established methods: EM- 
MAX, FaST-LMM (two-step approximation), and GWFGLS 
(implementation of the mmscore method of ProbABEL [11] 
in the MixABEL-package). In the experiments we considered 
three different scenarios, varying the sample size, the number 
of SNPs, and the number of traits, while keeping the other 
two values constant (Fig. 2). A description of the experimental 
setup is provided in the Supplementary Note, Section 4. 

In the first scenario (single trait and 10,000,000 SNPs), even 



though all methods exhibit a quadratic behaviour, CLAK- 
Chol is the only algorithm that completed all tests within 
1.5 days. For the largest problem considered (sample size 
n = 40,000), the speedup over FaST-LMM is 4.53: 158 hours 
vs. 35; for n — 1,000 instead, the speedups over GWFGLS, 
FaST-LMM and EMMAX are 15, 28 and 106, respectively: 
38, 68 and 257 minutes vs. 2.5 minutes. 

The second scenario (single trait and sample size of 10,000) 
shows a linear dependence on the number of genetic markers 
for all methods. Again, CLAK-CHOL attains the best timings, 
outperforming FaST-LMM, GWFGLS and EMMAX by a 
factor of 6.3, 56.8 and 112, respectively. 

Thanks to CLAK-ElG's linear complexity with respect to 
sample size, SNPs, and traits, the advantage in the analysis 
of multiple phenotypes (third scenario: sample size of 1,000 
and 1,000,000 genetic markers) becomes most apparent: when 
thousands and more traits are considered, CLAK-ElG outper- 
forms GWFGLS, FaST-LMM and EMMAX by a factor of 
105, 177, and 418, respectively, bringing the execution time 
from several months down to two days. 

Establishing the functional roles of genetic variants and 
finding the biological link between genetic variation and 
complex phenotypes remain significant challenges in the post- 
genomic era. Genome-wide association studies are increas- 
ingly applied to understand the regulation of human and 
animal transcriptome [1], metabolome [2], [3], glycome [4] 
and other types of omics data; they are also used to uncover 
the link between these molecular phenotypes and high-level 
complex traits, including common diseases [12]. In GWAS, it 
is well recognized that genetic (sub)structure can act as con- 
founder and lead to false-positive discoveries, unless correctly 
accounted for; one of the most flexible and powerful methods 
for such a correction is provided by the mixed models [6]. 
However, this powerful method comes at a high computational 
price: Even when using the most advanced methods and 
tools [8], it takes a few hours for the GWAS to complete. 
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Figure 2: Timing comparison. Panels (a) and (b) include timings for EMMAX, GWFGLS, FaST-LMM, and CLAK-CHOL, 
relative to single trait analysis; (c) and (d) present a comparison of EMMAX, GWFGLS, FaST-LMM, and CLAK-ElG in the 
case of multiple traits. In (a), the number of SNPs is fixed to m — 10,000,000 and the sample size n ranges from 1,000 to 
40,000. In panel (b), the sample size is fixed to n — 10,000 and the number of SNPs m ranges between 10 6 and 3.6 x 10 7 . 
In (c) and (d), n = 1,000, m = 10 6 and t ranges from 1 to 100,000. 



While this time scale might not seem a big hindrance when 
only a relatively small number of phenotypes are analyzed (for 
instance in studies of complex traits and common diseases), the 
issues become apparent as soon as omics data are considered. 
The computational time becomes prohibitively large (up to 
several years), and in order to complete the studies within 
reasonable time one has to count on supercomputers with 
thousands of cores. This solution delays the research and 
increases its cost. 

In this work, we clearly demonstrate that use of problem- 
specific knowledge, coupled with innovative techniques for 
algorithm generation and with hardware-tailored implementa- 
tions, leads to both a decrease in computational complexity and 
an increase in algorithmic efficiency. Specifically, for the anal- 
ysis of omics data we were able to attain remarkable speed- 
ups, making it feasible to analyze — on a single standard multi- 
core computer, as opposed to supercomputing facilities — tens 
and even hundreds of thousands of phenotypes in the course 
of few days rather than years. 

Further optimizations are possible, for instance by exploit- 
ing the structure of the kinship matrix. A compressed MLM 



approach was proposed for decreasing the effective sample 
size of datasets by clustering individuals into groups [13]; 
similarly, the fast decaying and possibly sparse structure of 
the kinship matrix can be exploited to decrease the algorithmic 
complexity. 

We believe that the approach outlined here — integrating 
problem-specific knowledge with automatic algorithm gen- 
eration and hardware-tailored implementation — has many 
applications in high-throughput analysis of biomedical data. 

The routines CLAK-CHOL and CLAK-ElG 
are available as Supplementary Software and at 
hpac . rwth-aachen . de/CLAK-GWAS/. 
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1 Mixed Models for GWAS 

The Variance Components model for a quantitative trait can be formulated as 

Y = X/3 + R, 

where Y is the vector containing the phenotypes for n individuals, X is the design 
matrix, and (3 and R are the vectors of fixed and random effects, respectively. The 
partitioning X = [l\L\g] indicates that the design matrix is composed of three parts: 
1 denotes a column-vector (corresponding to the intercept) containing ones, L is an 
n x p matrix corresponding to fixed covariates such as age and sex, and g typically 
consists of a single column-vector containing genotypes. The vector of random effect R 
is assumed to be distributed as a Multivariate Normal with mean zero and variance- 
covariance matrix M = a 2 ■ (/i 2 $ + (1 — h 2 )I); here, a 2 is the total variance of the 
trait, h 2 (in the range [0, 1]) is the heritability coefficient -which defines the strength of 
the correlation between phenotypes of relatives-, / is the identity matrix, and $ is the 
matrix containing the relationship coefficients for all pairs of studied individuals. The 
relationship coefficient is defined as the proportion of the genome identical-by-descent 
that a pair of individuals share; for example, in the case of identical twins (they have 
the same DNA) the relationship coefficient is 1, while since a parent transmits half of 
its genome to the offspring, the relationship coefficient for a parent with its offspring is 
0.5. The relationship matrix $ can be estimated from the pedigree or from the genomic 
data [1]. In GWAS, the quantity of interest is the effect of the genotype, that is, the 
element (s) of (5 corresponding to g. Technically, a GWAS with mixed model consists of 
traversing all measured polymorphic sites in the genome, substituting the corresponding 
g into X, and fitting the above model; the result is millions of estimates of genetic effect 
together with their p-values. 



One of the most used mixed model-based approaches used in GWAS relies on a two- 
step analysis methodology [2, 3, 4, 5, 6]. In the first step, the reduced model (with X = 
[1|L]) is fit to the data, thus obtaining the estimates {a 2 ,h 2 }; the variance-covariance 
matrix corresponding to such estimates is denoted by M = a 2 • (/r$ + (1 — h 2 )I). In 
the second step, for each g { and corresponding X i = [l\L\gi], the estimates of the effects 
and the variance-covariance matrix are respectively obtained as 

k = {XjM^X^XjM-'y, 

and 

VarCfr) = o 2 ■ (XfM^Xi)- 1 , 

with i = 1, ... ,m, and m is the number of genetic markers considered. 

In this work, we consider an extended formulation of this problem to the case of 
multiple phenotypes, that is, Y is a collection of t vectors, with t/j (j — 1, . . . , t) being a 
vector corresponding to a specific trait. In this case, trait-specific estimates a 2 , h 2 need 
to be obtained, resulting in t different MjS. As the result of the analysis, m x t vectors 
of estimates of Pij and corresponding Var((3ij) are generated. In summary, the problem 
we are facing is (see Fig. 1 of the main manuscript text for a visual description) 

4- = ( Xj M~ 1 X; ) ~ 1 Xj M~ 1 yj 
M 3 = oj(hp + (1 - h))I) 

2 Single-instance Algorithms 

In this section we provide an overview of a simplified version of CLAK-CHOL and 
CLAK-ElG to solve a single GLS; the versions tailored for sequences of GLS's are then 
introduced in Section 3. In the following, we use b to indicate (3; in both our algorithms, 
the computation of Var{b) represents an intermediate result towards b. 

CLAK-Chol The approach for CLAK-Chol (Alg. 1) is to first reduce the initial 
GLS b := (X 7 M~ 1 X)~ 1 X T M~ x y to a linear least-squares problem, and then solve this 
via normal equations. Specifically, the algorithm starts by forming M = a 2 • {h 2 <& + 
(1 — h 2 )I), which is known to be symmetric positive definite, and by computing its 
Cholesky factor L. This leads to the expression b := (X T L~ T L~ 1 X)~ 1 X T L~ T L~ 1 y, 
in which two triangular linear systems can be identified and solved — X' := L~ 1 X 



(1) 



with 1 < i < m 
and 1 < j < t. 



(2) 



and y' := L~ l y — thus completing the reduction to the standard least-squares problem 
b := (X' T X')~ l X' T y' . Numerical considerations allow us to safely rely on the Cholesky 
factorization of S := X' T X' without incurring instabilities. The algorithm completes 
by computing b' := X' T y' and solving the linear system b := S^ 1 ^, for a total cost of 
|n 3 + 0(n 2 p) for a single GLS. 



Algorithm 1: CLAK-Chol's approach 
for a single GLS problem 



! M := ct 2 • (h 2 <S> + (1 - h 2 )I) 

2 LL T = M 

3 X:^L- 1 X 

4 y := 

6 5 := X T X 

6 b := X T y 

7 6:=5 -1 6 



CLAK-ElG Instead of forming the matrix M, Algorithm CLAK-ElG (Alg. 2) oper- 
ates on the matrix $: At first, it diagonalizes $ as ZWZ T , leading to the expression 

M := a 2 (h 2 ZWZ T + (1 - h 2 )I), 

with diagonal W. By orthogonality of the inverse of M can be represented as 

M" 1 := Z(a 2 (h 2 W + (1 - h 2 )I))~ 1 Z T 

and easily computed via 

D : = (a 2 (/^W + (1 - h 2 )!))- 1 ; 

the solution to the GLS is thus given by 

b := (X T ZD- 1 Z T X)- 1 X T ZD~ 1 Z T y. (3) 

Moreover, since D is symmetric positive definite, Eq. 3 can be rewritten as 

b := (X T ZKK T Z T X)- l X T ZKK T Z T y, 

and the algorithm proceeds by computing V := X T ZK (matrix-matrix multiplication 
and scaling), obtaining b := (VV T )~ 1 V K 7 Z T y. Similar to CLAK-Chol, b is finally 
obtained through matrix-vector multiplications and a linear system, for a total cost of 
fn 3 + 0(n 2 p). 
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3 Sequences of GLS' 

As shown in Table 1, the strength of CLAK-Chol and CLAK-ElG becomes apparent 
in the context of ID and 2D sequences of GLS', corresponding to GWAS with single 
and multiple phenotypes, respectively. The straightforward approach, which is the only 
alternative provided by current general-purpose numerical libraries, lies in a loop that 
utilizes the best performing algorithm for one GLS. No matter how optimized the GLS 
solver, such an approach is prohibitively expensive for either single or multiple pheno- 
types, due to the unmanageable complexity of 0(mn 3 ) and 0(tmn 3 ), respectively. By 
contrast, the versions of CLAK-Chol and CLAK-ElG shown in Fig. ?? are the prod- 
uct of a number of optimizations aimed at saving intermediate results across successive 
problems, thus avoiding redundant calculations. For instance, the matrix X is logically 
split as LYfjXft], where X L includes the intercept and the covariates ([1|L]), while X R 
is the collection of all the genetic markers Thanks to these savings, both algorithms 
achieve a lower overall complexity (Table 1). 



Algorithm 3: CLAK-Chol 
single-trait analysis 



Algorithm 4: CLAK-Eig for 
multi-trait analysis 
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Unfortunately, the reduced complexity of algorithms CLAK-Chol and CLAK-ElG 
is not enough to guarantee high-performance implementations. It is well known that 
in terms of execution time, a straightforward translation of algorithms into code and 
a carefully assembled routine often differ by at least one order of magnitude. In other 
words, the benefits inherent to our new algorithms might go unnoticed unless paired 
with state-of-the-art implementation techniques. In this section we detail our strategy 
to attain high-performance routines. 

Both CLAK-Chol and CLAK-Eig are entirely expressed in terms of standard 
linear algebra operations such as matrix products and matrix factorizations, provided 
by the BLAS [7] and LAPACK [8] libraries. Since LAPACK itself is built in terms of 
BLAS kernels, these are the main responsible for the overall performance of an algo- 
rithm. BLAS consists of a relatively small set of highly optimized kernels, organized 
in three levels, corresponding to vector, vector-matrix, and matrix-matrix operations, 
respectively. A common misconception is that all the BLAS kernels, across the three 
levels, attain a comparable (and high) level of efficiency. Instead, it is only BLAS-3 
-when operating on large matrices- that fully exploits the processors' potential; as an 



example, the matrix-vector multiplication, matrix-matrix multiplication on small ma- 
trices, and matrix-matrix-multiplication on large matrices attain an efficiency of 10%, 
~ 40%, and more than 95%, respectively. In this context, the linear systems X := L~ 1 X 
in CLAK-Chol (Line 3 of Algorithm 1), to be solved for each individual SNP, should 
ideally be aggregated into a single — very large — linear system Xr := L~ 1 Xr, in which 
X R is the collection of the genetic markers of all SNPs (Line 3 of Algorithm 3). An 
analogous comment is valid for CLAK-ElG (see the multiplications at Lines 3 and 4 of 
Algorithm 4), in which the number of markers accessed at once is a user-configurable 
input parameter. 

Since all the current computing platforms include multicore processors, we now 
briefly discuss how to take advantage of this architecture. An effective practice is to 
invoke a multi-threaded version of the BLAS library. Both in Algorithm 3 and 4, we 
rely on such a solution for the sections leading up to the outer loop (Lines 1-6 and 
1-4, respectively). In the remainder of the algorithms (Lines 7-11 and 5-13), due to 
the lack of matrix-matrix operations, we instead apply a thread-based parallelization 
in conjunction with single-threaded BLAS. This mixed use of multi-threading becomes 
more and more effective as the number of available computing cores increases, leading 
to speedups up to 10 or even 20%. 

3.1 Space requirement 

CLAK-Chol To form and factor the variance matrix, the algorithm uses n 2 memory 
(Lines 1-2 in Algorithm 3). The overall space requirement is determined by the triangu- 
lar solve (Line 4), which necessitates the full L and a portion of X to reside in memory 
at the same time. This operation is performed in a streaming fashion — operating on k 
SNPs at the time — and overwriting X. All the other instructions do not require any 
extra space. In total, CLAK-Chol uses about n 2 + knp memory. 

CLAK-ElG The initial eigendecomposition (Line 1 in Algorithm 4) needs 2n 2 mem- 
ory. The following matrix-matrix multiplications (Lines 2, 3 and 8) overwrite the input 
matrices and again are performed in a streaming fashion; in terms of space, the analysis 
is similar to that for the triangular solve in CLAK-Chol. The remaining instructions 
do not affect the overall memory usage. In total, the space requirement is 2n 2 + knp. 

If memory is a limiting factor, one can set fc to 1 in either algorithm, possibly at 
the cost of exposed data movement. Moreover, by using a different eigensolver, the 
eigenvectors Z can overwrite the input matrix $, effectively saving n 2 memory. These 
considerations indicate that our algorithms are capable of solving GWAS of any size, as 
long as the kinship matrix and one SNP fit in RAM. 



3.2 Time complexity 

As it was described in Sec. 1, before the sequence of GLS can be solved, one needs to 
estimate the parameters related to the random part of the model, namely the heritability 
(h 2 ) and the total variance (a 2 ). Software packages such as GenABEL [9], EMMAX, and 
FaST-LMM use algorithms similar to our CLAK-ElG. First, the eigendecomposition 
of the kinship matrix is performed, for a computational cost of 0(n 3 ). Then, the model 
parameters are estimated using an iterative procedure based on the Maximum Likelihood 
(GenABEL, FaST-LMM) or Restricted Maximum Likelihood (EMMAX) methods, for a 
cost of vnp 2 , where v is the average number of iterations required to reach convergence. 
In total, the parameter estimation has a complexity of |n 3 + 0(vnp 2 ) operations. 

By contrast, when our CLAK-Chol algorithm is used, the initial estimation re- 
quires a Cholesky factorization of the kinship matrix at each of the v iterations, for a 
total cost of \v n 3 operations. Due to the reduced cost and better performance of such 
a factorization, this strategy is advantageous when v <~ 30, a condition commonly 
verified in practice. 

In the second step, a ID or 2D sequence of GLS is solved, corresponding to single 
and multiple phenotype analysis. For ID sequences, all the considered methods share 
the same asymptotic time complexity, but the constant factor for CLAK-Chol is the 
lowest, yielding at least a 4-fold speedup. In 2D sequences, EMMAX, FaST-LMM 
and GenABEL simply tackle each individual trait independently, one after another, 
for a total complexity of tmpn 2 . By exploiting the common structure of the variance- 
covariance matrix of different phenotypes, CLAK-ElG reduces the complexity by a 
factor of n, down to tmpn. As a result, our algorithm outperforms the other methods 
by a factor of 100 and more. 

4 Computing environment 

All the computing tests were run on a SMP system consisting of two Intel Xeon X5675 
6-core processors, operating at a frequency of 3.06 GHz. The system is equipped with 
96GB of RAM and 1TB of disk as secondary memory. 1 The routines were compiled with 
the GNU C Compiler (gcc v4.4.5) and linked to Intel's MKL multi-threaded library 
(vl0.3). For double-buffering, our out-of-core routines make use of the AIO library, 
one of the standard libraries on UNIX systems. The available multi-core parallelism is 
exploited through MKL's multi-threaded BLAS and OpenMP's pragma directives. 



1 Rccall that our algorithms do not require large amounts of available memory: As long as the kinship 
matrix fits in memory, they will complete. 
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in n 

ID sequence 


ZD sequence 


Naive 




0(mn 3 ) 


0(mtn 3 ) 


CLAK-Chol 


CHOL-BASED 


0(mpn 2 ) 




CLAK-Eig 


EIGEN-BASED 




O(tmpn) 


FaST-LMM 


EIGEN-BASED 


0(mpn 2 ) 


0(tmpn 2 ) 


GWFGLS 


EIGEN-BASED 


0{mpn 2 ) 


0{tmpn 2 ) 


EMMAX 


EIGEN-BASED 


0(mpn 2 ) 


0(tmpn 2 ) 



Table 1: Computational costs for single GLS, as well as 1-dimensional and 2-dimensional 
sequences of GLSs. The variables n,m and t denote the sample size, the number of 
genetic markers, and the number of traits, respectively, p is the width of the X matrix, 
which is determined by the number of fixed covariates used (e.g. age and sex) and the 
genetic model assumed. 

4.1 Simulated data 

A data set for GWAS can be characterized by the number of individuals in the sample 
(n), the number of measured and/or imputed SNPs (m) to be tested, the number of 
outcomes to be analyzed (t), and the number of covariates (p) to be included in the 
model. In current GWAS, a typical scenario consists of a few covariates (for example, 
two, such as sex and age), 10 5 10 7 SNPs, and thousands or tens of thousands indi- 
viduals; also, only one or a limited number of outcomes are studied. In our experiments, 
we assumed that the number of covariates is two (p=2), and we varied the three other 
characteristics (n, m, t) of the data set leading to these three scenarios. 

(A) The number of SNPs was fixed to m = 10,000,000 and one single outcome (t = 1) 
was studied. As sample size, we used 1,000, 5,000, 10,000, 20,000, and 40,000. The 
latter test represents a scenario with large number of individuals. 

(B) The number of individuals was fixed to n — 10, 000 and one single outcome (t = 
1) was studied. The number of markers m to be analyzed was set to 1,000,000, 
10,000,000, and 36,000,000. The latter test is a scenario that represents a whole 
genome resequencing. 

(C) The number of individuals and markers were fixed to n = 1,000 and m = 1,000,000, 



respectively. The number of outcomes t studied varied (1, 10, 100, 1,000, 10,000, 
and 100,000), corresponding to an Omics analysis. 

For testing purposes, we generated artificial data sets which met pre-specified values 
of t, m, p, and n. 
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